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Abstract 

We consider two connected aspects of maximum likelihood estimation of the pa¬ 
rameter for high-dimensional discrete graphical models: the existence of the maxi¬ 
mum likelihood estimate (mle) and its computation. 

When the data is sparse, there are many zeros in the contingency table and the 
maximum likelihood estimate of the parameter may not exist. Fienberg and Rinaldo 
(2012) have shown that the mle does not exists iff the data vector belongs to a face of 
the so-called marginal cone spanned by the rows of the design matrix of the model. 
Identifying these faces in high-dimension is challenging. In this paper, we take a 
local approach : we show that one such face, albeit possibly not the smallest one, 
can be identihed by looking at a collection of marginal graphical models generated 
by induced subgraphs Gi,i = 1,..., k of G. This is our hrst contribution. 

Our second contribution concerns the composite maximum likelihood estimate. 
When the dimension of the problem is large, estimating the parameters of a given 
graphical model through maximum likelihood is challenging, if not impossible. The 
traditional approach to this problem has been local with the use of composite like¬ 
lihood based on local conditional likelihoods. A more recent development is to have 
the components of the composite likelihood be marginal likelihoods centred around 
each V. We hrst show that the estimates obtained by consensus through local 
conditional and marginal likelihoods are identical. We then study the asymptotic 
properties of the composite maximum likelihood estimate when both the dimension 
of the model and the sample size N go to inhnity. 

*H. Massam gratefully acknowledges support from NSERC Discovery Grant No A8947. 
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1 Introduction 

Let V = p} be a hnite index set. We consider N individuals that we classify 

according to criteria or variables e V. We assume that for each v E V, take 

its values in a hnite set /^. Let G = (V, E) be an undirected graph where E is the 
set of undirected edges {i,j) E V x V. We assume that independences and conditional 
independences between the variables Xy, v E V are represented by G in the following way: 

_L Xy^ I ^V\{v-l,V2} if (^ 1 ,^ 2 ) ^ E. 

Thus the distribution of X = (X^,n G V) belongs to a discrete graphical model Markov 
with respect to G. The data are gathered in a p-dimensional contingency table with cells 
/ = Htjev parameters of this model are the cell probabilities or, equivalently, the 

loglinear parameters 9 in (1.1) below if we write the density of the cell counts under its 
natural exponential family form 

f{t-,e) = exY>{{e,t) - Nk{e)} . ( 1 . 1 ) 

Here t is a vector of marginal cell counts and {6fi) denotes the inner product of t = t{x) 
and the canonical loglinear parameter 9. Discrete loglinear models are widely used in many 
scientihc areas and two aspects of these models have been the topic of much research. The 
hrst is the existence of the maximum likelihood estimate (henceforth abbreviated mle) of 
the parameters. And the second is that of the computation or approximate computation 
of the mle. These two aspects are connected as we shall see below. Our contribution in 
this paper is two-fold and concerns these two aspects. 

The mle of the parameter is said to exist if all cell probability estimates are strictly 
positive. The reader is referred to Fienberg & Rinaldo (2007) and Fienberg & Rinaldo 
(2012) for a complete list of references on the topic and a most interesting historical 
account of the developments. It has been known for a long time (see Birch, 1963 and 
Haberman, 1974) that the nonexistence of the mle is due to the presence of zeros in the 
contingency table. Zeros can exist even when p is small, but they occur particularly often 
when p is large and the sample size N is relatively small. Whether the mle exists or not 
is determined by the position of the data vector t in the convex hull of the support of 
the measure p generating the hierarchical loglinear model (1.1). Eriksson et al. (2006) 
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were the first to express a necessary and sufficient condition for the existence of the rule 
in these terms. They showed that the mle exists if and only if the data vector belongs 
to the interior of what they call the marginal cone, i.e., the cone C generated by the 
convex hull of the support of /r. Fienberg and Rinaldo (2012) showed that this necessary 
and sufficient condition is valid for all sampling schemes (Poisson, multinomial or product 
multinomial). The marginal cone C is a polyhedral cone and, thus, in order to determine 
whether the mle exists, one has to identify whether the data vector belongs to one of its 
faces. In the supplementary material of their paper, Fienberg and Rinaldo (2012) gave 
several linear programming algorithms to identify the smallest face of the cone containing 
the data vector. Practically, however, these algorithms cannot be implemented in high 
dimensions. 

Our hrst contribution in this paper is to provide, for high-dimensional problems, a way 
to identify a face of C containing the data vector t. To do so, we take a local approach. 
We consider a hnite collection of subgraphs of G induced by subsets A,, z = 1,..., /c 
of V. We solve each local problem, that is, we hnd the smallest face ifyi containing the 
vector of marginal cell counts fy* in the marginal model Markov with respect to Ga^- This 
face can be extended to a face F) of G containing t. In Theorem 3.1, we show that 
is a face of the marginal cone G of the global model containing the data vector t. The 
face is not necessarily the smallest one containing t, but knowing that t belongs 

to tells us that the mle does not exist and gives us a model of dimension smaller 

than the dimension of A4 for which we can attempt to evaluate the mle. Of course, this is 
true only if is not equal to the relative interior of G. If it is, our procedure yields 

no information. In our simulations, if the data belonged to a face of G, we never had this 
situation. In fact, we found that the face containing t was always equal to the intersection 
This is probably due to the fact that a simulated data vector will rarefy fall on a 
face of small dimension which would not show up as 

Our second contribution in this paper concerns the composite maximum likelihood 
estimates of 6 in (1.1). The mle of 6 is that value of the parameter 6 that maximizes 
the likelihood or, equivalently, the loglikelihood function l{6) = {0,t) — Nk{6). The log 
partition function or cumulant generating function k{6) is usually intractable when the 
dimension of the model is large. As a consequence, even though the likelihood function is 
a convex function of 6, the traditional convex optimization methods using the derivative of 
the likelihood cannot be used. Approximate techniques such as variational methods (see 
Jordan et al., 1999, Wainwright and Jordan, 2008) or MCMC techniques (see Geyer, 1991, 
Snijders 2002) have been developed in recent years. More recently still, work has been 
done on a third type of approximate techniques based on the maximization of composite 
likelihoods (Lindsay, 1988). For a given graphical model and a given data set ..., 
a composite likelihood is typically the product of local conditional likelihoods, coming from 
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the local conditional probability of given which we can write as 


N 

= n = 41 -^ 0 ) ( 1 , 2 ) 

v£V k=l 

where N'y indices the set of neighbours of v in G. For work using this type of techniques 
applied to discrete graphical models, the reader may refer to Liu and Ihler (2012) and 
references therein. 

In the last two years, for Gaussian graphical models, Wiesel and Hero (2012) and Meng 
et ah (2013, 2014) have introduced composite likelihood estimation where the composite 
likelihood is the product of local convex ’’relaxed” marginal (rather than conditional) 
likelihoods coming from p{Xy, Xj^^) rather than from p{Xy \ Xj^^). For discrete graphical 
models, Mizrahi et ah (2014a) have proposed a similar marginal approach, taking a 
clique of G and its neighbourhood. In the papers mentioned above using the composite 
likelihood, from either marginal or conditional local likelihoods, the global composite mle 
is obtained by ’’consensus”, i.e. by combining the local mle’s obtained from the various 
n-local likelihoods. The value of the global composite mle is therefore a function of the 
values of the mle of the n-local likelihoods. 

In this paper, we extend the marginal approach of Meng et ah (2013, 2014) to discrete 
graphical models and give two results. The hrst. Theorem 4.1 states that the composite 
mle obtained through this marginal approach is equal to the composite mle obtained 
through the conditional approach. We conclude that there is therefore no point in using 
the more computationally complex marginal approach to compute the composite mle. 
Our second result. Theorem 5.1 gives the rate of convergence of the composite mle to the 
true value 9* of the parameter when both p and N go to inhnity. In Theorem 5.2, we 
give the rate of convergence of the global mle, under the same conditions and compare 
the two rates of convergence. 

The existence and computation of the mle are of course connected, if a local mle does 
not exist, the composite mle cannot be computed by consensus and probably does not 
exist. An optimization routine may not signal that the mle does not exist and will usually 
give a fallacious number for the local mle In this case, running a convex optimization 
routine on each local problem without looking at the details of each maximization may 
lead to an inaccurate estimate (and erroneously lead to think that the two composite mle, 
obtained from local marginal and conditional likelihoods, are different). 

The remainder of this paper is organized as follows. In the next section, we introduce 
our notation for the discrete graphical model. In Section 3, we prove our result on the 
identihcation of a face containing the data vector. In Section 4, we study the local marginal 
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likelihood estimator and its relationship to the local conditional likelihood estimator. In 
Section 5, we give its asymptotic properties. The proof of the theorems are given in the 
Appendix. 

2 Preliminaries 

2.1 Discrete graphical and hierarchical loglinear models 

Let p, V and X = (X„, n G Id) be as described in Section 1 above. If N individuals are 
classihed according to the p criteria, the resulting counts are gathered in a contingency 
table such that 

/=n^- 

v&V 

is the set of cells i = (i^, v G Id). For D cV,iD denotes the marginal cell io = (a, v E D) 
with E ly. Let "D be a family of non empty subsets of Id such that D E T), Di E D and 
Di ^ 0 implies Di E T>. In order to avoid trivialities we assume VJd&vD = Id. The family 
T) is called the generating class of the hierarchical loglinear model. We denote by VLx> the 
linear subspace of y E such that there exist functions 9d E for D E V depending 
only on and such that y = YId&v that is 

= {y G : 39d E D E V such that 9j;){i) = 9£)iiD) and y = 9o} 

D&V 

The hierarchical model generated by T> is the set of probabilities p = {p{i))i^i on I such 
that p{i) > 0 for all i and such that logp G flv- 

The class of discrete graphical models Markov with respect to an undirected graph G 
is a subclass of the class of hierarchical discrete loglinear models. Indeed, let G = (V, E) 
be an undirected graph where Id is the set of vertices and E G V x V denotes the set 
of undirected edges. We say that the distribution of X is Markov with respect to G if 
{vi,V2) ^ E implies 

Xy^ T A'.yj I . 

Let V be the set of all cliques (not necessarily maximal) of the graph G. If the distribution 
of X = (Xi,...,Xp) is Multinomial(l,p(i),i G I) Markov with respect to the graph 
G, and if we assume that all p{i),i E /, are positive, then, by the Hammersley-Clifford 
theorem, logp(i) is a linear function of parameters dependent on the marginal cells iD,D E 
V only, and therefore the graphical model is a hierarchical loglinear model with generating 
set the set V of cliques of G. The reader is referred to Darroch & Speed (1983), Lauritzen 
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(1996) or Letac & Massam (2012) for a detailed description of the hierarchical loglinear 
model and the subclass of discrete graphical loglinear models. 

We now set our notation and recall some basic results for discrete hierarchical loglinear 
models. The following notation and results can be found in Letac & Massam (2012) and 
the corresponding supplementary hie. 

Among all the values that can take in G V, we call one of them 0. For a cell 
i G /, we dehne its support S(i) as 

S'(i) = {w G ; iv 7^ 0} 

and we dehne also the following subset J of / 

J = {jel, S{j)eV}. (2.1) 

From here on, we will call this set the J-set of the model. For i G / and j G J, we dehne 
the symbol 

j<\i 

to mean that S{j) is contained in S{i) and that js(j) = is{j)- The relation < has the 
property that if j, j' G J and i G /, then 

j < j' and j' <1 ^ j < i. 


The log linear parametrization that we use for the multinomial is the so-called baseline 
parametrization with general expression, for i&I,S{i)=EGV, 


6»i= ^(-1)1^1 logp(fF,0y\i7’) . (2.2) 

FCE 


With the notation above, in Proposition 2.1 of Letac and Massam (2012), it is shown that 
for i ^ J, 9i = 0 and that 


log p{i) 
logp(O) 


Z (-1) 

i'eJ, j'<ij 


|S(i)l-|S(i')l 


log 


P{f) 

p(0) 


, J e J 


Oq -|- ^ ^ 9j, i E. I 


(2.3) 

(2.4) 


One then readily derives the density of the multinomial M{N,p{i),i G I) of the cell counts 
n = G /), Markov with respect to G to be, up to a multiplicative constant, equal 

to 


f{t; 9) = exp{(f, 9) — Nk{9)}, 9 G (2.5) 
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with 9 = {9j,j e J), t = tin) = j G J) where t(j) = n{is{j)) are the js(j)-marginal 
cell counts and 

= log (^exp ^ 9^ =log(l+ exp Y1 ■ ^2.6) 

iGl jejjoi i£l\{0} j£J,j<i 


For 9 G -R*^, these distributions form a natural exponential family of dimension J generated 
by a measure fi which we will now identify. Let ej,j G J be the canonical basis of 
and, for i E I, let 

f^= ej- 

j&JJoi 

Then (2.3) and (2.4) can be written in matrix form as 

logp = A9 (2.7) 

where 9^ = {9o,9^), A is an (|/|) x (1 + |J|) matrix. We call A the design matrix of the 
model. The rows of A are indexed by i G / and equal to fi = (1,//) G It is 

immediate to see that the Laplace transform of the generating measure /i is 


em 


i&l 


and therefore the measure /i generating (2.5) is 


/i(dx) = ^h/^(x). 
i&l 


( 2 . 8 ) 


This exponential family is concentrated on the convex hull of fi, i E I, which is a bounded 
set of and therefore the set of parameters 9 for which L is hnite is the whole space 
From the dehnition of X, fi,i E I and t = {t{js{j),j E j), it is easy to see that 
{N,t{j), j E jy = A^n = J2i^in{i)fi and the vector of sufficient statistics t, which we 
also write as t = tj to emphasize its length, is such that 


t_i _ A ^ tY - YYf 
N \ N J 2^ N ^ N 

ie/\{0} i&I 


(2.9) 


belongs to the convex hull of {fi)i£i- The {fifs are the extreme points of the closure of 
the convex hull of the fi,i E I. 

Fienberg and Rinaldo (2012), Theorem 3, show that the mle of 9 in (2.5) exists and is 
unique if and only if the canonical statistic vector t belongs to the relative interior of the 
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cone C with apex /q and generated by /*, i € / \ {0}. Therefore, if the mle does not exist, 
the data vector t must belong to one of the facets of C. Following Eriksson et al. (2006) 
and Fienberg and Rinaldo (2012), we will call C the marginal cone of the model. The 
reader is referred to Letac and Massam (2012) for examples of the notions given above. 

3 Faces containing the data vector t 

3.1 Finding the smallest face containing t 

Let C* be a closed convex polyhedral cone. A set F <Z C is said to be a face of C if for all 
X E F, any decomposition of the form x = y + z with y,z E C implies y,z E F. Given 
g E , the inequality {g,x) > 0 is said to be valid for C if it holds for every x E C. 
Then the set 

Fg = {x eC : {g,x) = 0} 

is called the face of C governed by g. Every face of C arises in this manner. There is only 
one face of dimension 0 and that is {/o}- The faces of dimension 1 are the extreme rays 
{Xfi, A > 0} for each /,, i G / \ {0}. Since a convex set is the convex hull of its extreme 
points, a face F of C will be dehned by a set 

fi.iEF El. 

To identify the smallest face F of G containing the data vector t E , we will have to 
identify the subset F of I dehning that face or equivalently we will have to identify a 
g E R"^ such that 


{g,h) = 0, VzgF 
{g,U) > 0, V*gX\F 

Let /+ be the subset of X with strictly positive cell counts n{i) > 0. We have the following 
lemma which shows that X” D /+. 

Lemma 3.1 The sufficient statistic t belongs to the face Fg of C governed by g if and 
only if fi E Fg for all i E ffi. 

The proof is obvious if we write that t E Fg ^ {t, g) = 0 ^ = 0 

{fi)9) = 0 Vi G /+. The face Fg may contain additional ffs. Recall that the design 
matrix A given in (2.7) has rows (1,//). Let B be the |/| x | J| matrix with rows //,i G I. 
Let R+ be the matrix obtained from B by keeping the rows indexed by only, and let Bq 


be the matrix obtained by keeping the rows in / \ /_|_ only. Fienberg and Rinaldo (2012) 
showed that J^\I+ could be identihed by solving the linear program 

Max||R5(||o (3.1) 

s.t. B^g = 0 

Bog > 0 , 

where ||.||o is the zero norm in R'^ . This is a non-convex program which is, however, 
easily solved by a sequence of linear programming relaxations (see programs (4) and (6) 
in Fienberg and Rinaldo (2012), supplementary material). We adopted their approach 
to hnd the smallest face containing the data vector t whenever the dimension is small 
enough for the program to run. We found that we could use this program only for models 
with up to 16 vertices. 

3.2 Splitting the global model into smaller models 

Since it is difficult or impossible to implement the program (3.1) in high-dimension, we 
now consider a collection of smaller models and we show that the combination of these 
smaller models yields information on the global model. 

Suppose that X follows a model M Markov with respect to the undirected graph 
G = (y,E). Let A gV and let Ga be the graph induced by A. Let Ma be the model 
Markov with respect to Ga] this is also a multinomial model. The generating set Ra of 
Ma is a subset of V. Let 

Ja = {j e J, S{j) e Va}. (3.2) 

Then tj^ = (t{j) = n{js(j)), j G Ja) is the canonical data vector of Ma- Let Ga be the 
marginal cone of Ma- Clearly Ga is generated by ia G I a where G 

Ja) is made up of the components of fi,i ^ I that are indexed by Ja- We may have 
coming from different /ds but clearly we keep only one copy. The following lemma shows 
how to extend a face of Ga into a face of G - 

Lemma 3.2 For gA G , let Fg^ = {x G Ga ■ {gA,x) = 0} be a face of Ga containing 
tj^- Let gi = (5fA,0...,0) G R"^ he the vector of R'^ obtained from gA by setting the 
remaining variables to 0. Then 

Fgi = {x G C : {gi,x) = 0} 


is a face of G containing t. 
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Proof. Let J^a denote the index set of the fi^ dehning Fg^. We have 

^9Ai 0) ^ J'A 

{9AjiA,JA) > 0 , 

{9A,tj^) = 0 . 

Writing i e I as i = and fi = /i,j^c), we have 

{ 91 , fi) = {9a, Aja) + (0, fi,JAc) = 0) i such that e Fa 

{ 91 , fi) = {9a, Aja) + (0, fi,JA<^) > 0) * such that i ^ Fa 

{91, t) = {9 a, tjA) + ( 0 , = 0 . 

It follows immediately that Fg^ is a face of C containing t and it is dehned by fi, iA & Fa- 

□ 

We now use this lemma for a hnite number of smaller models and in doing so, obtain an 
even smaller face containing the data vector t. Let M be the model Markov with respect 
to G = (y,E). Let Gai, I = 1...,/c be a collection of subgraphs induced by subsets 
Ai, I = 1,k of V. Let Vai be the generating set for the model Markov with 
respect to Gai and let be its canonical statistic. 

Theorem 3.1 Let Ai, I = 1,... ,k be a collection of subsets of V. Let qai define faces 

of Gai containing tj^^, I = l,...,k respectively, let gi be the vectors of obtained by 

completing gAi with zeros and let Fi = Fg^ be the corresponding faces. Then 

k 

9 = J29i 

1=1 


defines a face Af^^Fi of G containing t. 

Proof. We give the proof for A; = 2. If /j G Fi fl F 2 , we have that {fi,gi) = {fi, 92 ) = 0 
and therefore {fi,g) = 0. Moreover, if ^ Fi fl F 2 , either {fi,gi) > 0 or {fi,g 2 ) > 0, and 
therefore {fi,g) > 0. It follows that Fg = {x E G : {g,x) = 0} is a face of G. Moreover, 
{g, f) = { 91 , f) + { 92 , t) = 0 and therefore t E Fg which clearly is equal to Fi fl F 2 . □ 

Remark 3.1 In practice, when applying the theorem above, we choose the collection Ai 
so that V = [J^^fiDAi with Ai,l = 1,... ,k as large as possible but small enough that we 
can compute the smallest face containing tj^^. In experiments similar to that presented in 
Section 3.4 below, we nearly always found that nJk^^Fj was the smallest face containing t. 
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3.3 A numerical experiment 

We illustrate our results with a numerical experiment. We consider the graphical model 
with G equal to the four-neighbour 4x4 lattice with binary data so that |/| = 2 ^® and 
I J| = 40. We generated the data vector 

t = [1 2754527732 10 281500114141312207311221520 0]. 

We let Ai, i = 1, 2, 3 be the subsets of V comprising the vertices in the i-th and {i + l)-th 
row of the lattice. We find that the face Fa, of Ca, containing is of dimension 15 
and the expanded face Fi is therefore of dimension 15 -|- 22 = 37. The corresponding gi is 

gi = 10 (o, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, -l). 

The face is of dimension 13 while F 2 is of dimension 13-1-22 = 35. The corresponding 
vector is 

g2 = 10 (0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, -1, 1, -1, -1, 0, -l) 

The face Fa^ is of dimension 11 and T 3 is of dimension 11 -|- 22 = 33 with 

g^ = 10 (0, 1, 1, 0, 1, 0, 1, 1, -1, 0, 0, 0, -1, -1, -1, -1, 1, 1). 

In this particular case, after computing the facial set of F, the smallest face of the global 
model containing t, we hnd that actually F = Fi fl F 2 ft F 3 which is of dimension 28. 

4 The conditional and marginal composite maximum 
likelihood estimators 

When the dimension of the discrete graphical model is large, computing the maximum 
likelihood estimate of 6 in (2.5) is challenging, if not impossible. As mentioned in the 
introduction, a recent approach to this problem has been local with the use of a composite 
likelihood which is equal to the product, over all vertices n G 17, of the local conditional 
likelihood for given where denotes the set of neighbours of v in G. Recently, 
for Gaussian high-dimensional graphical models, Wiesel & Hero (2012) and Meng & ah 
(2013, 2014) worked with a different composite likelihood which is the product, over all 
vertices n G R, of local marginal likelihoods. In this section, we will hrst recall the 
definition of the conditional composite likelihood estimate, then extend the marginal 
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composite likelihood to discrete graphical models and hnally show that the maximum 
likelihood estimates obtained from these two types, conditional and marginal, of local 
models are in fact identical and thus the composite likelihood obtained by consensus 
from these two types of likelihood are equal. Since the computational complexity of the 
marginal computations is exponential in the number of vertices in the neighbourhood of 
V while the conditional computations are linear in this number, there is no advantage in 
working with marginal composite likelihoods. 

4.1 The conditional composite likelihood function 

We hrst dehne the standard conditional composite likelihood function. For i = (4, v G V), 
let ..., be a sample of size N from the distribution of X Markov with respect to 
G. We recall that the global likelihood function is 

N 

L{9) oc Wp{Xy = e V) = exp{{e,t) - Nk{e)} (4.1) 

k=l 


where k{6) is as in ( 2 . 6 ). 

For a given vertex v E V, let TV), the set of neighbours of v in the given graph G. 
The composite likelihood function based on the local conditional distribution of X^ given 
^y\{D} or equivalently, due to the Markov property, the conditional distribution of X^ 
given its neighbours is L^^{6) = where 

N 

= VtAV. = (4.2) 

fc=l 

and the superscript PS stands for ’’pseudo-likelihood”, the name often given to the con¬ 
ditional composite lilelihood (Besag, 1974). As given by (2.3), for a given cell i, we have 

\ogp{i) = \ogp{X^ = iy,v eV) = 6*0 + 

j<li 

= do + ^ + X] 

joi, S{j)<ZvUjVv,S{j)<^jVv j<li, S{j)CjVv joi, S{j)<^vUj\fv 

The set J is as dehned in (2.1) for the global model. Let 

jps. = {jeJ\ S{j) CvuK, S{j) 2 = {j e J I n e S{j)}. 
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Then for iy ^ 0, we have 


pi,^v '^v\ ^-Mv) ‘^v\ ^V'\{^!} ^V'\{^I}) 


pjXy = iy) 

P{Xy\{v} = fv\{t;}) 
^9o+J2j^i^ jejPSv s(j)c^r^ ^j+J2j<ii, s(j)<ivuj\rv 

E l S{j)GAfv S(j)^vUMv I 

/cG7| }i^Y\{v} Y J 


1 + E 


k£l\ fcv\{i;}=V\{D}> kvj^O 




and 


(4.3) 


p{,Xy 0| ^V'\{i;}) 


1 + E 


fce/| kY\yy=iv\{v}, k^^O 


(^^3<k, ^3 


(4.4) 


Equality (4.3) is due to the fact that the set of j G J such that j < fc, S{j) ^ v Uj\fy, is 
the same whether ky = iy or ky ^ iy and therefore the term cancels 

out at the numerator and the denominator. The same goes for the set of j G J such that 
j <fc, S{j) C My. 


Remark 4.1 In the equation above, we worked withp{Xy\Xy\{y3^) rather than with P(X^|X;v;), 
though the two are equal, in order to emphasize that the parameter 

= {Oj, J G veV (4.5) 

of the v-th component of conditional composite distribution is a subvector of 9, the 
parameter of the global likelihood function. 


We now dehne the two-hop conditional composite likelihood function. 

Definition 4.1 For a given v eV, we will say that My is a one-hop neighbourhood of v 
if it comprises v and its immediate neighbours in G, i.e. if My = {n} uMy. We will say 
that My is a two-hop neighbourhood if it comprises v, its immediate neighbours and the 
neighbours of the immediate neighbours in G. We use the notation 

M2y=My\[{v}UMy) 

to denote the set of neighbours of the neighbours ofv. For simplicity of notation, we will 
denote both the one-hop and two-hop neighbourhoods by My. 
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The two-hop conditional composite likelihood function is (9) = Ht^ey (9) 


N 




— j{k) 




- Ak) 


(4.6) 


k=l 


The expression of p{Xy = i^\Xj^^ = is the same as (4.3) and (4.4) but 

with replaced by -vv^here 

JPS,,„ = {j^j\ SU) c My, SU) ^ A44. 

In a parallel way to Remark 4.1, we note that 

gPS2,. ^ ^ 

is a subvector of 6^ = {9j,j G J), the argument of the global likelihood function. 

4.2 The marginal composite likelihood 

Let My be the one-hop or two-hop neighbourhood of v. The marginal composite likelihood 
is the product 

N 

L"(9)=n =i£>.)=n 

vGV k=l v&V 

where )• At^-marginal model is clearly multinomial 

and the corresponding data can be read in the At^y-marginal contingency table obtained 
from the full table. The density of the At„-marginal multinomial distribution is of the 
general exponential form 

/(t^"; 9^^) = exp{{t^\9^^) - Nk^^{9^^)} (4.8) 

where and are respectively the A4t,-marginal canonical statistic, canonical 

parameter and cumulant generating function. 

In order to identify the A^„-marginaI model, we hrst establish the relationship between 
9 and 6*'^”. In the sequel, the symbol j will be understood to be an element of when 
used in the notation 0^” while it will be understood to be the element of J obtained by 
padding it with entries jv\Mv = 0 when used in the notation 9j. We now give the general 
relationship between the parameters of the overall model and those of the Wl^-marginal 
model. Proofs are given in the Appendix. 
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Lemma 4.1 Let be the one-hop or two-hop neighbourhood of v E V. For j E 
J,S{j) C Aiy, the parameter 6j of the overall model and the parameter of the marginal 
model are linked by the following: 

" = 9^+ exp ^ . (4.9) 

j'lj'Ooj '= I 

kitj’ 

We now want to identify which of the marginal parameters are equal to the corre¬ 
sponding overall parameter and in particular which marginal parameters are equal to 0 
when the global parameter is equal to zero. Let Al); denote the complement of in V. 
We define the buffer set at v as follows: 

By = {w E j\4y I 3w' E M.y with {w,w') E E}. (4-10) 

We have the following result. 

Lemma 4.2 Let My be the one-hop or two-hop neighbourhood of v G V. For j E 
J,S{j) C My the following holds: 

(1.) zfS{j)^By, then of'’= 9^, 

(2.) if S{j) C By, then in general 0^” ^ 9j, and (4.9) holds. 

Moreover, for i E I, S{i) C My, 

(3.) If S{i) (f. By, then 9f^'’ = 0 whenever 9i = 0. 

From the lemma above, we see that, for j E J such that S{j) C My,S(j) (fi By, the 
corresponding global and Al,;-marginal loglinear parameters are equal. We see also that 
for i G / such that S{i) G My, S{i) By, if the loglinear parameter is zero in the global 
model, it remains zero in the My-m.dxgmdX model. 

4.3 A convex relaxation of the local marginal optimization prob¬ 
lems 

It is clear from (4.9) that even though maximizing the marginal likelihood from (4.8) is 
convex in it is not convex in 9. We would therefore like to replace the problem of 
maximizing (4.8) non convex in 6^ by a convex relaxation problem. We know from (1.) of 
Lemma 4.2 that 0^” = 9j for j in the set {j E J \ S{j) C My, S(j) (fi By} . 
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We also know from (3.) of Lemma 4.2 that if the global model parameter 0,, S{i) C 
M.V, S{i) By is equal to zero, then is also equal to zero. Following what has been 
done for Gaussian graphical models, it is then natural to consider the following graphical 
model relaxation of the A4^-marginal model. 

Let Aii^v index the relaxed hierarchical loglinear model obtained from the Wf^y-marginal 
model by keeping interactions given by edges with at least one endpoint in Aiy \ By and 
only those edges, and all interactions in the power set 2^"’. The index I takes values I = 1 
or I = 2 when Aiy is respectively the one-hop or two-hop neighbourhood of v. The J-set 
of this local model is 

jM,. = {jeJ\ S{j) c My, S{j) <;tBy}u{zeI\ S{i) c By} . (4.11) 

Let denote the marginal probablity of in the Alz,t,-marginal model. 

The local estimates of 6j,j G {j G J\ S{j) C My, S{j) (/l By} are obtained by maximizing 
the Al/,.i;-marginal loglikelihood 

N 

(0) = Yl {Xm^ = iu .) = exp{ )} (4.12) 

k=l 

which is a convex maximization problem in 

= [Oj,] G J^*'"). 

At this point, we need to make two important remarks. 

Remark 4.2 The vector defined in (4.5) is a subvector of . Therefore maxi¬ 
mizing (4.12) for either I = or I = 2 will yield an estimate of 6^^^. 

Remark 4.3 The Mi^y, I = 1, 2-marginal model is a hierarchical loglinear model but 
not necessarily a graphical model. For example, if we consider a four-neighbour lattice 
and a given vertex Vq and its four neighbours that we will call 1,2,3,4 for now, then the 
generating set of the relaxed Mi^y^-marginal model is 

= {(^ 0 , 1 ), (^;o, 2), (no, 3), (no, 4), (1,2, 3,4)}. 

This is not a discrete graphical model since a graphical model would also include the 
interactions (no, 1, 2), (no, 2, 3), (no, 3,4), (no, 1,4), (no, 1, 2, 3,4). It was therefore crucial to 
set up our problem, as we did it in Section 2, within the framework of hierarchical loglinear 
models rather than the more restrictive class of discrete graphical models. 
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4.4 The composite maximum likelihood estimates 

To obtain the composite mle of the global loglinear parameter 6, we do the following. 
First, for each v E V, we compnte the maximnm likelihood estimates from as in 
(4.2) (in the conditional composite likelihood case) or for in (4.12) (in the marginal 
composite likelihood case). Second, in each case, we retain only the estimate of 9^^^. 
Third, the global maximal composite likelihood estimate of each 6j,j G J is derived 
either by simple averaging of the different estimates of or 9j j G obtained 
from the varions local models or by more sophisticated ways snch as described in Lin and 
Ihler (2012). This inclndes linear consensns, maximnm consensns or ADMM. If one nses 
the two-hop neighbonrhood, then the accnracy is snch that nsnally simple averaging is 
snfficient to obtain very good accnracy. 

If follows immediately that if we can prove that the mle of 6'^'^'' obtained from and 
from are identical, then the maximal composite likelihood estimates of 6 obtained 
from the conditional or the marginal composite likelihood by consensns will be the same. 
In the next snbsection, we show that this is indeed the case. 

4.5 Equality of the maximal conditional and marginal composite 
likelihood estimate 

Let = 1,2 denote the maximnm likelihood estimate of obtained from the 

local likelihood (4.12). 

Theorem 4.1 The PS component of 9"^^’'" ,i.e. G ) is equal to the maximum 

likelihood estimate of 9^^^ obtained from the local conditional likelihood (4.2). 

Similarly, The 2PS component of 9-’^^’'= ,i.e. G is equal to the maximum 

likelihood estimate of 9^^^’'" obtained from the local conditional likelihood (4.6). 

The proof is given in the Appendix. 

At this point, we onght to make an important observation. In the case of the two- 
hop marginal likelihood, it may happen that the bnffer is no longer eqnal to A 2 u- 
For example, if we consider a fonr-neighbonr 5 x 10 lattice, the vertex 39 is snch that 
= {19, 28, 30, 37,48, 50} while = A2 d\{ 50}. The argnment in the proof of Theorem 
4.1 for j snch that S{j) ^ M 2 V then breaks down since in the Al 2 ,i;-iiiarginal model, 
some cells snch as = (iso = lAso = 1, 0 ;n„\{ 30 , 50 }) with snpport in A/ 2 ^ no longer 
have a complete snpport. This situation is illustrated in Figure 1 where for the sake of 
comparison, we also draw a vertex for which A/ 2 ^ = B^ and Theorem 4.1 applies.. 

In Tables 1 and 2, we give the numerical values of the maximum likelihood estimate 
9j,i G obtained by the four local model PS, PS 2 ,Aii,v and Ai 2 ,v for j such that 
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Figure 1: Two vertices in a 5 x 10 lattice: Theorem 4.1 applies for vertex 25 while it does 
not apply for vertex 39. 


j G and for j such that j G respectively. We see that in the hrst case, the 

values of dj obtained from the local likelihoods and qj-q identical and similarly 

for those obtained from and Z-^2,25^ while in the second case, the values obtained 

through the PS 2 and Ai 2 ,v models are slightly different. The values obtained from the 
PS and models are identical since then = Nv and the proof of Theorem 4.1 does 
not break down. 


Models 

O 25 

^ 15,25 

6 ^ 24,25 

^ 25,26 

^ 25,35 


-0.0536 

0.5914 

-0.4808 

-0.8314 

-0.8461 

A42,v 

-0.0779 

0.5221 

-0.5310 

-0.7274 

-0.7459 

(v,PS) 

-0.0536 

0.5914 

-0.4808 

-0.8314 

-0.8461 

(v,2PS) 

-0.0779 

0.5221 

-0.5310 

-0.7274 

-0.7459 


Table 1: The composite mle of some 6j,i G 5 x 10 lattice 


Remark 4.4 The equality of the estimates holds also for the marginal estimates obtained 
by Mizrahi et al. (2014b) if, for q a clique ofG and v E q C Aq, satisfying the strong LAP 
condition with respect to Aq, we retain only the parameters 9j,j G fig. We also note 
that Theorem 9 in that paper may not be verified in some cases. For example, take vertex 
7 in a 3 X 3 lattice numbered from left to right starting with the top row, take q = {7, 8} 
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Models 

O 39 

^29,39 

^38,39 

^39,40 

^39,49 


-1.0799 

-0.3306 

-0.3647 

-0.5791 

1.1749 

A42,v 

-1.0386 

-0.3519 

-0.5020 

-0.5445 

1.1946 

iv,PS) 

-1.0799 

-0.3306 

-0.3647 

-0.5791 

1.1749 

{v,2PS) 

-1.0381 

-0.3531 

-0.5019 

-0.5448 

1.1947 


Table 2: The composite mle of some 9j,j G the 5 x 10 lattice 

as the clique of interest. Then Aq = {4, 7, 8} satisfies the strong LAP condition but 9s in 
the Aq-marginal model cannot he equal to 9s in the joint model as our Lemma f.2 shows. 

4.6 Existence of the mle and the composite likelihood estimates 

To hnish this section, we describe a numerical experiment illustrating the impact of the 
non existence of the global mle on the computations of the various composite maximum 
likelihood estimates. 

For each sample size N = 40, 60 and 80, we consider two experiments, one where the 
data point t = (tj,j E J) = {ri{js(j)),j E J) belongs to a face of the global model and 
one where it does not. For each experiment, we compute hve estimates, the global mle, 
and the four composite likelihood estimates based on 9-^^’^ ,v E V. For 

each one of the hve estimates and for each experiment, we report the relative mean square 
error (MSE) 

\\9-9*\\‘^ 

~W*\F~ 

where 9* denotes the true value of the parameter. The results are given in Table 4.6 


Sample size 

40(on face) 

40 (not on face) 

60 (on face) 

60 (not on face) 

80(on face) 

80(not on fac 

Global MLE 

102.72 

3.1270 

39.68 

2.2620 

27.12 

1.5717 

Ml MLE 

134.94 

3.6335 

66.00 

2.3214 

43.46 

1.6454 

PSi MLE 

127.64 

3.6335 

32.63 

2.3214 

24.34 

1.6454 

M 2 MLE 

340.81 

3.1328 

65.01 

2.2700 

42.67 

1.5728 

PS 2 MLE 

84.52 

3.1320 

44.55 

2.2709 

29.76 

1.5727 


Table 3: The relative MSE of different estimates of 9 for the 5 x 10 lattice when the mle 
exists and when it does not. 


19 



























We note that when the global nile does not exist the mean square error for all ex¬ 
periments is much larger than when the mle exists indicating that some of the local mle 
estimates do not exist. When a local mle does not exist, a routine maximization of the 
local likelihood may lead to erroneous results. We have thus illustrated that the numerical 
results from each local maximization must be examined carefully to detect any potential 
existence problem. We note also that the data vector may be on a face of a global model 
without being on the face of any local model. In this case, clearly, the composite mle of 
6 is not a consistent estimate of Oq. 

5 Asymptotic properties of the maximum composite 
likelihood estimate 

The asymptotic properties of the maximum composite likelihood estimate when p is hxed 
and N goes to -|-oo are well-known (see Jordan and Liang (2009)). In this section, we 
consider the asymptotic properties of the conditional composite mle (which is also the 
marginal composite mle) when both p and N go to J-cxd. In Theorem 5.1 below, we give 
its rate of convergence to the true value 6*. In order to compare the behaviour of the 
composite mle with the global mle, we also give, in Theorem 5.2, the rate of convergence 
of the global mle under the same asymptotic regime. It will be convenient to introduce 
the notation 





(5.1) 


In this section, we work exclusively with Therefore for simplicity of notation 

we write 9 for 9^^'’. Also, for convenience, we scale the log likelihood by the factor A. 
Then the u-local conditional log-likelihood function is 


- N l^n=i ^ a^S)}} 

The sufficient statistic is tj = A write 



tjPSv [ti, ^2) ■ ■ ■ ) 


(5.2) 
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and 




n=l 


yvelv\{0} 


j^jPSv 


n=l 


where 


Z"’’'(0) = 1+ ^ exp{ ^ Qjfjiyv^x^^l)]. 

yv&lv\{0} 

Then the log -likelihood fnnction is 


j£jPSv 


Its hrst derivative is 

dOk 

dk^^-{e) 

dOu 


tk 


dk^^^ie) 


de, 


N 


-y 

N ^ 


exp{EjeJ^s„ Ojfj{k^, x^y}} ^ 

r7-n.„rO\ Jk[Kv,Xj^ 


n=l 


2'"v(6i) 


with 


PlEjgjps. ^jfjikv,x^y}} 

Z^x(^0) 


= p{X^ = k 




(5.3) 


We now want to compnte ^qq^qq^ = — ^ 90 ^^ ,k,lG J^^'’. To simplify fnrther 
notation, we set 

ZyM= y 

j^jPSv 

For ky = ly, nsing (5.3), we obtain 


Xm 


onr 


(5.4) 


g2fcPS„(-g) ^ E f /' exp2:^;„(6>) ^^2 


ddkddi 


_ J_ 

N z-^n=l 

= ^En=l (piXy = ky\xPj -p{Xy = ky\X^y)‘^)fk{ky,X^y)fl{ly,X^^'^ 


)fkiky,xPjfl{ly,Xy 


if ky ^ ly, then 

a'^kPSv(g) _ J_ exp 2 fc„ (9) exp zi^ (0) n ,, f n 

dOkdei N X,n=l (Z".*'(0))2 Jk{l^v, X) Jlyiy, X) 


= jf En=l(-P(^3; = ky\xPjp{Xy = ly\x^y)fk{ky,xPjfl{ly,X^y . 


21 














Let 

notation 


ifj {jv: Xm 


(n) 


,j G be the X 1 vector of indicators. We introduce the 



' exp (e) 


Z"’^{e) 

exp Zk^ (6») exp zi 


W\2 


^ exp Zfc, 


if)) 




if K = Iv 
if ky ^ ly 


(5.5) 


Let {6, be the dy x dy matrix with (/c,/) entry Vk’J{6, . Then the Fisher 

information matrix derived from is 

1 ^ 

n=l 


where o denotes the Hadamard product of two matrices. We make two assumptions on 
the behaviour of the cumulant generating function G V at 6*, similar to those 

made by Ravikumar et al. (2010) and Meng (2014): 

(A) For the design matrix of the n-local conditional models, we assume that there exists 
Dmax > 0 such that 


max Xmax 
v&V 





n=l 


< D 


max j 


(B) We assume the minimum eigenvalue of the Fisher Information matrices (6*), v G 

V is bounded, i.e., there exists Cmin > 0 such that 


1 


N 


C„.„ = minA„. X o , 


v&V 


N 


n=l 


We are now ready to state our theorem on the asymptotic behaviour of the composite 
mle 6 obtained by averaging the mle of 9^^'’ obtained from the various ^-neighbourhood 
as indicated in Section 4.4. 


Theorem 5.1 Assume conditions (A) and (B) hold. If the sample size N and \V\ = p 
satisfy 


N 

logp 


> max 

v&V 


toe Dyyiaxdy >2 
) 1 

^min 


(^2 _^ 

where C is a positive constant such that p^ > ‘^'Yhv^ydv, then the conditional composite 
mle 9 = {9j,j G J) is such that 


\\9-9*\\f < 


SC' 


C„ 


N 


(5.6) 
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with probability greater than 1 


2E 


DgV 




The proof is given in the Appendix. With a similar argument, we can derive the behaviour 
of the global mle, which we will denote by 9^. We need to make assumptions similar to 
(A) and (B). We assume that 


(A') there exists D^ax > 0 such that ® /i) < Dmax, 

{B') t)< K* = x^Jh!\e*) . 


i&I 


The asymptotic behaviour of 9^ is given in the following theorem. 

Theorem 5.2 Assume conditions {A!) and {B') hold. If N and p satisfy the condition 


N mC\J\D. 


\ogp 


> 


max \ 2 


n 


*2 




where C is a positive constant such that p^^^ > 2\J\, then the global mle 9^ = {9f,j G J) 
is such that 


\\ 9 ^ 


9 *\\f < — 




5C \J\ \ogp 


N 


(5.7) 


with probability greater than 1 — 


The proof is provided in the Supplementary hie. Comparing Theorem 5.1 and 5.2, we 
see that for ^ — 9 *\\f = with high probability while for 

j = 0 {ma.Xyfzv{dl)), \\9 — 9 *\\f = This implies that for the composite 

mle, the requirement on the sample size N are not as stringent as for the global mle but 
of course, we lose some accuracy in the approximation of 9 *. The situation is, however, 
not bad since 


'^v&V \ogp j l\J\ \ogp 


N 


N 


\J\ 


which is the square root of the ratio of the sum over n G C of the number of parameters 
in the n-local conditional models and the number of parameters in the global model. If 
the number of neighbours for each vertex is bounded by d, we see that this ratio is at 


most equal to and usually much smaller than that. For example, in an Ising model. 


\J\ 


\J\ = p + \E\ and dv = P + 2|i7| and therefore = 1 + 


\E\ 


p+\E\ - 


< 2 . 
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6 Conclusion 


In this paper, we have taken a local approach to study the existence of the maximum 
likelihood estimate of the canonical parameter 0 in a high-dimensional discrete graphical 
model. We have shown that we can use smaller graphical models to detect the nonexis¬ 
tence of this mle. We have also taken a local approach to the estimation of 6 by looking 
at various possible versions of the composite likelihood estimate, based on local condi¬ 
tional or marginal likelihoods and we have shown that the two approaches yield the same 
estimate of 6. Through a numerical experiment, we have illustrated how we can be led to 
an incorrect maximum composite likelihood estimate of 6 if the global estimate does not 
exist. Finally, we have described the asymptotic behaviour of the maximum composite 
likelihood estimate of 6 when both the dimension p of the model and the sample size N 
tend to inhnity. We have shown that when the number of neighbours of each n G is 
bounded, the rate of convergence of the composite mle is comparable to that of the global 
mle. 
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7 Appendix 


7.1 Proof of Lemma 4.1 

We will use the notation j <o j' to mean that j < j' or j = 0, the zero cell. Let p^'’{i) 
denote the marginal probability of i G Imv- We know that the A4t,-marginal distribution 
of is multinomial. By the general parametrization of the multinomial model (2.2), 
for i G J, S{j) C since S{j) is complete. 






E (-1) 

j'e J, j'oj 

where by abuse of notation, j such that S{j) C is considered as an element of /_m 
M oreover, 


(7.1) 


iel: iMv=j j' I I 

j'iij 

exp E E E 


j' I j'<oj 


iVj 


can write 


Therefore logp^’'(j) = Of + log (1 + =j exp E/ I ^f) > which we 

(7.2) 


V = logp''^"(7) - log P + ^ exp ^ . 

I Ii'<oi 


i&X,iM„=j k\k<i 

k^j 
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Moebius inversion formula states that for a C 1/ an equality of the form Xlbca 
is equivalent to *h(a) = Here, using a generalization of the Moebius 

inversion formula to the partially ordered set given by < on J, we derive from (7.2) that 
for i G C J 

6j = (—i)l'S’0')-‘S'(j')l 

j' I j'<oj 

- ^ ^ exp ^ 

j'\j'<oj k\k<ii 

k^j' 

exp ^/c ^ (^ * ^) 

k^j' 

which we prefer to write as (4.9). 

7.2 Proof of Lemma 4.2 

Since (4.9) is already proved, (2.) holds. Let us prove that (1.) holds, i.e., that when 
S{j) (f. i3„, the alternating sum on the right-hand side of (4.9) is equal to 0. Since j G J, 
S'(j) is necessarily complete and j' <j is obtained by removing one or more vertices from 

S{j). 

If S'(j) n 7^ 0 but S{j) ^ Bv, there is at least one vertex w G S{j) which is not in 
By. Let /o and ly, be the log terms in the alternating sum corresponding to j' = 0 and 
jlj <j such that 5'(j(„) = {w} respectively. Since for any neighbours m of tc in Aiy and for 
any i E I such that = j', the M-th coordinate iy must be zero and since w cannot have 
a neighbour outside Aiy, the set {6k, k < k /ij'} in Iq for such that = 0 is the 
same as the set {6k, k < k /ij'} in lyy for such that = j'y, and = ‘^v\Mv' 

The terms in Iq and ly, in (4.9) are therefore exactly the same except for their sign and 
these two terms cancel out. Similarly, for any given j' <j with w ^ S{j'), let j[y G J be 
such that S{j!yj) = S{j) U {w} and j[y <j, then, the set 6k, k <\ k /ij' in Iji and the set 
6k,k < i^‘^\k in are identical where, similarly to the argument above, is such 
that = j' and z^^^ is such that z^^^ = and = 'iy\M„- Therefore the terms /y 

and Iji^ cancel out and (1.) is proved. 
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To prove that (3.) holds, following (2.2), we have, for S{i) = E C A4y 

FCE 

= log (p(*F,0y\F) + ^ PiiF,OMv\F,kL,Ov\{MvUL))] 

FCE LcV\Mv kLCiL 

= 5](-l)l^\^llog(exp( e,)+ J2 E"^P( E E 

FcE jCJ,j<liF LcV\E k^dh d'^i'^F^k^) 

= ^(-l)l's\fllog(exp( «,)) (7.4) 

FcE jCJ,j<liF 

+ ^(_l)IAF|iog(i+ ^ ^exp( ^ 9,)) 

FcE LcV\F kpClL ji(iF,3^{iF<kF) 

= 9.+ 5^(-l)l'='-''llog(l+ 5^ 5^exp( ^ 9,)) (7,5) 

FcE LcV\F kpClp j^iFd<^{iF,kL) 

Now, following an argument similar to that of (1.) above, we can show that the second 
component of the sum in (7.5) is equal to zero. It follows that when = 0 then Of^'" = 0. 
This completes the proof of Lemma 4.2. 


7.3 Proof of Theorem 4.1 

The local relaxed marginal loglikelihood is 


N 


iMd(dMd) = ^ log(X m^ = j = E ^(^MjlogP^‘’^(iMj 

k=l iMvdMv 


It is immediate to see that 




aOi 


= t{j) - P'^'’’'(jso)) where p^^''’{js{j)) denotes 


the j 5 (j)-marginal cell probability in the A4/^^-marginal model. Therefore the likelihood 
”(0' 
dOi 

tU) = 0, (7.6) 


equations = 0, j G yield 


where t{j) = n{js{j)). 




For the argument to follow is essentially the same for the one-hop or two-hop neigh¬ 
bourhood. We present it for the more general case of the two hop neighbourhood. The 
local conditional log likelihood is 


iv,2PSfov,2PS\ 

‘ log -—-^-r- 

^ = W 2 J 


where 




n{iMj log 




i = W 2 J 

= ^ n(iMj log(X m, =iArJ - ^ n(W2j logp-^^’^^W. = W 2 J 

iMv&^Mv *W 2 „e/AA 2 „ 

= log 

'^J^2v ^uUA/V; ^^v\JJ\fv 

^IM2,.^qM2,.)_q ( 77 ) 


Q = Y ^(w2Jiog Y oxp[eo+ Y 

'^■^2v ^^■^2v ^vUM-u £-^uUA/\j 


(7.8) 


'=<(^t,UA7„.W2„) 

fcg j^2,u 


and 6*0 = — log(X]j^ gj^ exp J2k<iiM ^k)- The second equality above is due to the 

fact that in the expression (4.3) of l^l^o 6j such that S{j) ^ A4y 

and the 0j such that S(j) C A/ 2 ,, cancel out at the numerator and denominator and it 
therefore does not matter, for the conditional distribution of X„ua 4 given X_V 2 „, what the 
relationship between the neighbours are. The only thing that matters is the relationship 
between the vertices in v UAfy and the vertices in AT„ and according to Lemma 4.2, that 
remains unchanged when we change from the global model to the Ad 2 ,„-marginal models. 

We will now differentiate the expression of in (7.8) with respect to 9j^j G . 
We hrst note that 


99o a ^2,„ 


Usu))- 


If we use the notation 


1 ^ f 1 if J< (a^7>u^„,W2J 

i<i(a;„ujv„,*jv 2 „) I Q otherwise 
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and the notation p^‘^’'’{iE), E C to denote the marginal probability of in 

the Xl 2 ,i;-iiiarginal model, we have 


dQ , Sa:„uA/'„6/„uW„) W2,«) f ,W2„) (js{j)) 

*-V2„ fc-'A/2„ 

If j e 7 -^ 2 ,. is such that S{j) C then lj<x„uw„,W 2 j = 

dQ ^ , /•^'"(W2j(liW2.„<W2„ -p-^'’”05(i))) 


'^J^2v ^^■^2v 


= ri{js(j)) - Np^^’\js(j)) 

At the mle of the local Aii^v model, from standard likelihood equations (see Lauritzen, 
1996, Theorem 4.11), we have p^'^’'’{js{j)) = and therefore 

= 0, j€ J""-, S(j) C Vj.. (7.9) 


If j G 7 -^ 2 ,« is such that S{j) (/L A 2 d, i-e. if j G J 

0/0 / A . .. 7., 'll 


(is(i)n(7;UAr„), W2jliv2„<X2„ - (WaJ 

P^^'^{^^f2.) 


«(W 2 .' 

*.V2„ 6-^A72„ 




'^J^2v ^^J^2v 


^■^2v ^^J^2v 


P Us{j)n{vUj\fv)-: *A/'2„)liA72„<IW2„ 


Since in the A42,,;-marginal model, all the vertices in A/ 2 ,« are connected by construction, 
at the mle of the local Ai 2 ,v model, {ijV 2 v) = ^ and therefore 

AT A 4 ^'^ / ■ \ AT \ ^ / ■ * \ -i 

~ —Xp {jS(j)) + N 2_^ P Us'(i)n(DUA/'„) AA/2„)liAr2„<i*A72„ 

X2„SijV2„ 

= -Np^"'\jsi,)) + Np^"'\jsij)) = 0 (7.10) 
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It follows from (7.9) and (7.10) that the 2PS component of , i.e. 


e J 

is the mle of the local two-hop conditional likelihood. We therefore have 

§v,2PS ^ 


2,PS 


7.4 Proof of Theorem 5.1 

To prove Theorem 5.1, we need two preliminary results. 

Lemma 7.1 Let O'"'* = (^0*)^^'" be the true value of the parameter for the conditional 
model of Xy given and let he the value of 6^^" that maximizes Then, 

for tjpsv as in (5.2), if there exists e > 0 such that 


\\tjps.-{k^^^){e"' 


oo <e< 


'<2 

''min 


then 


\\0 


PS„ 


e"'*\\F < 


lODr 

b\/dfe 
n . 

^ m.rn 




(7.11) 


(7.12) 


Proof. To simplify our notation in this proof, we drop any subscripts and superscripts 
containing v or PS, except when it is necessary to keep them to make the argument clear. 
Let Q(A) = l{0*) — l{6* + A). Clearly Q(0) = 0 and Q(A) < Q(0) = 0, where 

A = 0 — 9*. Let ||A||i 7 = ■\jYl,j( 2 jpsv A| denote the Frobenius norm of A. Dehne (7(5) = 
{A| \\A\\f = 5}. Since (5(A) is a convex function of A, if we can prove 


inf QiA) > 0, 

AeC(5) 


(7.13) 


then, by convexity of Q, it will follow that A must lie within the sphere dehned by (7(5), 
i.e. ||A||i7 < 5. We are now going to prove that there exists 5 > 0 such that on (7(5), 
Q{A) > 0. For A G (7(5), we have 

Q(A) = i{e*) - i{e* + A) = e*H - k{e*) - {{9* + A)H - k{e* + A)) 

= k{9* + A)-k{9*)-AH = A^k’{9*) + lA^k''{9* + aA)A-AH, ae[0,l] 

= A^[k'{9*)-f\ + lA^k"{9*+ aA)A 


— 

Qi 


Q 2 


(7.14) 
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By Cauchy-Schwartz inequality, we have the following bound for Qi. 

IQil = \A^[k'{e*)-t]\ < \\k'{e*)-t\U\A\\,<eVd\\A\\F = eVd6 (7.15) 

For Q 2 , we have 

Q 2 > ^||A||| min Xmink”{9* + aA) = min Xmink”{9* + aA) (7.16) 

/ ae[0,l] / a6[0,l] 

We now want to bound the term q = min^gjo^] Xmin[k''{9*+aA)] from below. Following 
(5.4), we can write Zy^{9 + aA) = + ^^ 3 ) fAv-vi ^^nI) 1 then we can rewrite 

the entries of H in (5.5) as 

expzj.^ (6>*+aA) _/_ exp2:fc„ (e*+Q:A) _x2 -C U = 1 

-L fyA = } l+^!/«sl„\{O}exp^:,„(0*+aA) 1 l+Ej,„6/„\{o} exp^fc„ (0*+aA) 1 ^ ^ 

9 k,I \ expzii^{ 9 *+aA)expzi^{d*+aA) 


then 


(l+Ey„6/„\{o}exp2:a„(e*+aA))2 ’ 

dVkJi^* + 


if ki) ^ 


da 


Y1 (9k:i)'yA^* + 


yv&lv\{0} 


(n) N, dZy^ 

da 


/ nv\' //I* A (n)\ . i i i • • 

where \ )y^ [9 + a A, xj^j) = —^^ - —. ft is easy to see that these derivatives can 

all be expressed in terms of probabilities of the type (5.3) and that they are always less 
than 1 in absolute value. Therefore, since ~ '^jeJ-vesij) 


I {S*+aA,x‘'"l) 

I dcx. ' 


< Zly„e/A{ 0 } “ ^y.e/Afo} ^ieJ;i;e 5 (i)( 7 . 17 ) 
= 'l2jej-vesU) Sy.„e/„\{o} fjiVv^ ^ W^) . 


The Taylor series expansion of %’i{9* + aA,a;^^) yields 


Vk,i ^n!) = VkJ «' e [0, a] . 




Let K{9* + a'A, xj^j) denote the dy x dy matrix with entry 
to (7.16), we have 


. Coming back 


k"{9* + aA) = jfEti[Hi9* + aA,x^;;i)o[W'^{W-Y]] 

ji Eh, h{»’,AI) »|W'”(ir")‘] + q7 


= 1 Eh, »|W'”(ir")‘] + q 7 Eh, k(«' + “A, ii;>)»[iv"(w.-")‘|. 
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We write ||W ||2 = Amax(^) for fhe operator norm of a matrix X. We recall that the 
Hadamard product of two positive semidehnite matrices is positive semidehnite and there¬ 
fore 


Xmin{K{9* + a'A,a;S^) o [W^{W^y]) > + aA,xPj o [W"(W")*]). 

Then since |q!| < 1, we have 

■max„p[o,i] \ \aj^[Y,n=iK{9* + aA,x^y^l) o 1 12 


1 


N 


> C„.ta - II - Y, 


(7.18) 


n=l 


> Cmin - maXa 6 [ 0 ,l] 11 A| I 2 , 


A 


where the last but one inequality is due to our Assumption (B). We now need to bound 
the spectral norm of A = A A*W"'(iy"'(W"')*). For any a G [0,1] and y G with 
lll/lliT’ = 1, we have 


N 


N 


{y,Ay) = ^ |A‘lV"|(9‘ir‘)'' 


N 


n=l 


N 


n=l 


\A^W^\ ^ Vd\\A\\F = Vd6 . 

and, by dehnition of the operator norm and from Assumption (B), 


(7.19) 


N 


N 


N 




n=l 


< II — 
- "iV 


^W^{W 


n\t I 


< 


(7.20) 


n=l 


From (7.18), (7.19) and (7.20), we obtain max^gjo^] ||A ||2 < Dmax^/~d5 and therefore 

Q A Cuiin DiYiax~'Aii5 . 

Substituting this into (7.16), we get 


Q 2 A -^5‘^{Cmin — Dmax^/d5). 


(7.21) 
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From the two inequalities (7.15) and (7.21), it follows that 

Q(A) > Q 2 - IQil > - eVd5. (7.22) 

To simplify the problem, we can choose 6 such that Cmin — D^axVdS > that is, 

^ fj - Then inequality (7.22) becomes 

Q(A) > ^^1^ - eVd5 

and Q(A) is positive if we let 5 = Moreover 5 ^ ^ yields the following bound 

^rnin 2DmaxV d 

of e: 

fi2 

^ ^ ^ min 

We have therefore shown that (7.13) holds for 5 = and the theorem is proved. □ In 

^rain 

the next lemma, we will make use of Hoeffding inequality (see Hoeffding (1963), Theorem 
2) which states the following. If If W, X 2 , • • • , are independent and a* < Xj < hi{i = 
1, 2, • • • , n), then for e > 0 


P(|X -/i| > e) < 2exp 


-2n^e 


2^2 


■Trah 


T 


Lemma 7.2 Let tjps^, and be as defined above. For any e > 0, we have 

p{{\\tjpsv - (fc^^”)'(6'’'’*)||oo > e}) < 2d„exp(- —) (7.23) 

and we also have 

]\[p2 

p({max||!,” - > t}) < 2 Vrf„exp(-^) (7.24) 

v^V z 

v^V 

Proof. In this proof, as in the previous proof, we drop the superscripts f, PS except 
when necessary. For j G , we clearly have 




e* 


di{e) 

de. 


— Eo* 11 


dk{9)\ 
' 86, ) 


= EA^, 


N 

= jvlxlJfjiXv = jv, xPj 


n=l 
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and \fj{x^\x^^^) — p{xy = ~ Moreover, by Hoeffding inequal¬ 

ity, 

Pi\h - W)\ > e) < 2exp = 2exp(- ' 


22iV 


Applying a union bound, we get, for each v G V, 

dy 


AT^2 

p{\\tjps, - k'{9*)\\oo > e) < '^p{\tj - k'j{9*)\ > e) < 2d„exp(- —) 

i=i 


from which it follows that 


p{max\\tjps^-{k^^'’)'{9*)\\oc > e} < '^pi\\tjps„-{k^^'’)'{9*)\\^ > e) < 2'^dyexp{- 

VGV ' ^ ^ 


Ne^ 


vev 


vev 


□ 


Proof of Theorem 5.1 Proof. Let e = C'y where C is a constant that we will 
choose later in this proof. From Lemma 7.2, we have 


/] Nic\r^Y 

p{m^\\tjpsy - (/c^^”)'(0*)||oo > C\l -^) < 2^4exp( - ^ —; 


■ vev 


N 


vev 


2 'l2v&v 

p 2 


From Lemma 7.1, for e = Ca/ 

’ \l N — lODmaxdv 


, i.e. for N > ()2 iggp^ have 


\\tjPs.-{k^^^n9*)\\oo<e< 


fi2 


pps. _0v,*y < 




5v^e 
n . ■ 


The global composite mle 9 obtained by the local averaging of the 9^^'’ from each 
conditional model can then be bounded as follows: 


p-nF < 

— IZ^^g yl Cmin 2 ) 
_ 5C . lo^ 


Crr 


N 


Therefore under the condition N > max^gy ()2 logp, the following holds 


max, 


^v\\tj,.. - (F»-)>-)|U > C^) ^ ||Op».. - (F='")>-)|U > C^) 

\\§PSf _ gv,*y < 5 ^^ ^ p _ g*u < ^ I 

^min ^min \ -iV 
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with 


p{\\o-e*\\F < > p{iTmx\\tjps,-kl{e* 

L-min V iV V&V 


,„ < c,/!2iP) > 1- 


N 


cl 

P 2 


The theorem would make no sense if probability of the convergence rate was negative 
and thus C must satisfy 


2 Ylv&V 

cl 

P 2 


>0^C> \ 2 


log2E„en^« 


logp 


□ 
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